PEDS-1257 Add Difference class with diff#9
Conversation
| with open(file_path, 'rb') as f: | ||
| with PFBReader(f) as reader: | ||
| for record in reader: | ||
| submitter_id = record['object'].get('submitter_id') |
There was a problem hiding this comment.
the submitter_id is preserved across data version only for the subject node. Other node may change. so the diff should be made on all the values and the attribute subjects.submitter_id will be stable as a reference to the subject node.
| if submitter_id not in old: | ||
| diff_records.append(new_record) | ||
| else: | ||
| if new_record['object'] != old[submitter_id]['object']: |
There was a problem hiding this comment.
I would suggest not including the 'created_datetime' and 'updated_datetime' attributes within 'object' when checking to see if the record has changed.
| for sid in sorted(deleted_submitter_ids): | ||
| line = f" - {sid}" | ||
| print(line) | ||
| log_lines.append(line) |
There was a problem hiding this comment.
I think it would be useful to capture and output the removed records similar to the way 'diff_records' contains added/changed records.
sgchoe
left a comment
There was a problem hiding this comment.
Please see my line level comments regarding output of removed records and reconsidering how record changes are determined.
Created the class Difference in pfb_to_zip.py that generates the difference between two datasets writes it out as an avro file. This functionality can be used from the command line with -d (diff.zip will be downloaded). Additionally, there will be another file downloaded containing some relevant console output (such as withdrawn consent indicating deleted subjects).